feat(llm-katan): Add Kubernetes deployment support #710

noalimoy · 2025-11-20T23:42:47Z

Summary

This PR adds comprehensive Kubernetes deployment support for llm-katan, enabling multi-instance deployments with model aliasing capabilities.

Kubernetes Manifests (Kustomize-based)

Base deployment with security contexts and health probes
PersistentVolumeClaim (5Gi) for efficient model caching
Service (ClusterIP) exposing port 8000
Namespace isolation (llm-katan-system)

Multi-Instance Support (Overlays)

gpt35 overlay: Serves gpt-3.5-turbo alias
claude overlay: Serves claude-3-haiku-20240307 alias
Isolated PVCs per instance (prevents ReadWriteOnce conflicts)
Common labels component for consistent resource labeling

Model Caching Optimization

InitContainer (model-downloader) pre-downloads models to PVC
Smart caching: Skips download if model exists
Uses python:3.11-slim + hf download for ~45MB lightweight init
Main container starts instantly with cached model

Bug Fix (config.py)

Added YLLM_SERVED_MODEL_NAME environment variable support
Previously only worked via CLI arguments
Now enables Kubernetes env-based configuration

Documentation

Comprehensive deployment guide (deploy/docs/README.md)
Architecture explanation (Pod structure, storage, networking)
Kind cluster setup examples
Troubleshooting section with common issues

Test Results

Deployment Validation (Kind Cluster)

Resources Created:

Namespace: llm-katan-system
Deployments: llm-katan-gpt35, llm-katan-claude (both 1/1 Running)
Services: llm-katan-gpt35, llm-katan-claude (ClusterIP, port 8000)
PVCs: llm-katan-models-gpt35, llm-katan-models-claude (both 5Gi Bound)

API Validation:

GPT35 instance
$ curl http://llm-katan-gpt35:8000/v1/models
{"data":[{"id":"gpt-3.5-turbo",...}]}
Claude instance
$ curl http://llm-katan-claude:8000/v1/models
{"data":[{"id":"claude-3-haiku-20240307",...}]}

Motivation

This implementation addresses the need for:

Cloud-native deployments: Production-ready Kubernetes manifests
Multi-instance testing: Run multiple model aliases simultaneously
Efficient resource usage: Model caching prevents redundant downloads
Testing flexibility: Easy overlay creation for new model aliases

The Kustomize structure enables:

Consistent base configuration
Environment-specific customization via overlays
Easy addition of new model aliases without base changes

Related issue: #278

- Add comprehensive Kustomize manifests (base + overlays for gpt35/claude) - Implement initContainer for efficient model caching using PVC - Fix config.py to read YLLM_SERVED_MODEL_NAME from environment variables - Add deployment documentation with examples for Kind cluster / Minikube This enables running multiple llm-katan instances in Kubernetes, each serving different model aliases while sharing the same underlying model. The overlays (gpt35, claude) demonstrate multi-instance deployments where each instance exposes a different served model name (e.g., gpt-3.5-turbo, claude-3-haiku-20240307) via the API. The served model name now works via environment variables, enabling Kubernetes deployments to expose diffrent model name via the API. Signed-off-by: Noa Limoy <[email protected]>

netlify · 2025-11-20T23:42:52Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`04e7542`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/691fa77afa818c0008140a9c
😎 Deploy Preview	https://deploy-preview-710--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

github-actions · 2025-11-20T23:49:37Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `e2e-tests`

Owners: @yossiovadia
Files changed:

e2e-tests/llm-katan/deploy/docs/README.md
e2e-tests/llm-katan/deploy/kubernetes/base/deployment.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/kustomization.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/namespace.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/pvc.yaml
e2e-tests/llm-katan/deploy/kubernetes/base/service.yaml
e2e-tests/llm-katan/deploy/kubernetes/components/common/kustomization.yaml
e2e-tests/llm-katan/deploy/kubernetes/overlays/claude/kustomization.yaml
e2e-tests/llm-katan/deploy/kubernetes/overlays/gpt35/kustomization.yaml
e2e-tests/llm-katan/llm_katan/config.py

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

github-actions bot assigned yossiovadia Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llm-katan): Add Kubernetes deployment support #710

feat(llm-katan): Add Kubernetes deployment support #710

Uh oh!

noalimoy commented Nov 20, 2025

Uh oh!

netlify bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(llm-katan): Add Kubernetes deployment support #710

Are you sure you want to change the base?

feat(llm-katan): Add Kubernetes deployment support #710

Uh oh!

Conversation

noalimoy commented Nov 20, 2025

Summary

Kubernetes Manifests (Kustomize-based)

Multi-Instance Support (Overlays)

Model Caching Optimization

Bug Fix (config.py)

Documentation

Test Results

Deployment Validation (Kind Cluster)

Motivation

Uh oh!

netlify bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

github-actions bot commented Nov 20, 2025

👥 vLLM Semantic Team Notification

📁 e2e-tests

🎉 Thanks for your contributions!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

netlify bot commented Nov 20, 2025 •

edited

Loading

📁 `e2e-tests`